Apache Spark

https://spark.apache.org/

Spark Documentation Guide

https://spark.apache.org/docs/latest/

API Docs

Books and Other Good References

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
https://data-flair.training/blogs/spark-tutorial/
https://www.digitalocean.com/community/tutorials/an-introduction-to-mesosphere

Week #2 Exercise Questions

Today's Session Outline:-

Data Loading
- textFile
- parallelize
- from Master in a distributed environment (Week 3)
MapReduce
- transformation & actions
  - map & collect
  - filter & count
  - flatMap & take
  - filter & reduce
  - union, intersection
  - join, leftOuterJoin, cartesian, cogroup
  - reduceByKey & collect
Config
- using external files (Week 3)
- using API
- Spark Session (discussed with Spark SQL)
Partitions
- fill it with random data
- find/use the partitions
Spark SQL
- Basic operations

Things to remember:-

Learn your hardware configuration like RAM, CPU cores, etc.
All Spark functions will be given along with the questions, you have to fill Spark function with their respective parameters
and write the corresponding Scala or Python Logic
This is a practice session, so no scores are calculated
For quicker programming, we will use the shell environment today
If your IDE configurations aren't working, approach us after the exercise session
Ofcourse, Solutions will be provided for these questions after this exercise session
If you are already familiar with the contents listed above, go ahead in learning Spark SQL

Configuration:-

https://github.com/shathil/BigDataExercises

From the above url, download the git repository as a zip file. From the dowloaded zip package, extract a folder called resources and place it inside your spark 2.x folder.

1. Data Loading

Question 1

In the Spark shell, the SparkContext object is already initialized and available as sc.
Now load a "textFile" 'pg1567.txt' from the resources folder into an RDD named textRDD
Use the path as 'resources/pg157.txt' to load the textFile.
Now transform the textRDD into a map and split each line into words(use space as delimiter).
Finally count the number of words using "count"

Spark functions needed:-

textFile
map
count

Question 2

Create an Scala array or a Python list containg any 5 numbers and store it into a variable called "numbers"
Use Spark "parallelize" function on the numbers variable and store the resultant RDD as numRDD
Now transform the numRDD into a map by finding the squares of all numbers, store it as numMap
Finally collect the squared numbers using "collect"

Spark functions needed:-

parallelize
map
collect

2. MapReduce

Question 3

Generate random 10000 numbers and store it in a variable called "numbers"
parallelize the "numbers" and store the result into "numRDD"
Now filter the positive numbers from numRDD and,
Count the positive numbers

Spark functions needed:-

parallelize
filter
count

Question 4

Now use flatMap on "numRDD" and find the squares of all numbers and store it as "squaredRDD"
Try printing first element in squaredRDD using a Spark action called "first"
Try printing first 10 element in squaredRDD using a Spark action called "take"

Spark functions needed:-

flatMap
first
take

Hint:- flatMap-ed RDD might not work like map or filter functions. Calling actions on these flatMap-ed RDD directly, will definitely throw an error. Can you guess why?. If you have already experienced this, and solved it, congrats!.

Question 5

There are two ways to perform the logic inside the transformations: call a function, or an inline function
Try learning both ways
Combine question 3 and 4, Now filter all negative values, square them and print first 20 of them.

Spark functions needed:-

filter
use either map or flatMap
take

Question 6

Filter all positive values from numRDD into "filteredRDD"
Now reduce the filteredRDD (finding the sum of all positive values) and collect it
Also now collect the filteredRDD and the find the sum for the collected result
Compare the time difference between a native sum function and using a native sum function inside a Spark function. Did you find any difference?. No?
Okay now generate a million(1000000) random numbers and store it into numRDD and repeat above four steps
Still dont feel the significance, try increasing the count to 10 million ans repeat first four steps

Spark functions needed:-

reduce
collect

Note:

If your hardware is below 8GB and has other processes running , stop before 1 million. Try this when no other processes are running.

Hint:

There is an easy and graphical way to watch the time taken to complete the process.
Go to your browser, and type localhost:4040 and then hit Enter
What you see is Spark Web UI, we will cover this in detail during Cluster programming.
Did you notice the old Spark actions all listed there! :)

Question 7

This is an easy question
Create two RDDs with 5 numbers each called "num1" and "num2"
Find the "union" and "intersection" of the two RDD's
Told you it's easy :)

Spark functions needed:-

union
intersection

Question 8

This is also a simple one :)
I have given the RDD set 1 and 2 below. Try all listed Spark operations below on each set.

Spark functions needed:-

join
leftOuterJoin
rightOuterJoin
cartesian

RDD set 1:-



In [ ]:

    
# combining with just values
rdd1 =  sc.parallelize([("foo", 1), ("bar", 2), ("baz", 3)])
rdd2 =  sc.parallelize([("foo", 4), ("bar", 5), ("bar", 6)])

RDD set 2:-



In [ ]:

    
# combining with just list items
words1 = sc.parallelize(["Hello", "Human"])
words2 = sc.parallelize(["world", "all", "you", "Mars"])

Question 9

Let's count the words in all works of william shakespeare!
Load the 'textFile' named 'pg100.txt' from the resources folder into poemRDD
Use function chain on all these following actions:-
- Transform the poemRDD into a flatMap and split the words using space(' ') as a delimiter
- Now convert the flatMap into a map and convert each word into a tuple with two values (word, 1)
- Now reduceByKey and add all the values. Here key is the word
- Collect the result

Spark functions needed:-

flatMap
map
reduceByKey
collect

Self Exercise:-

Think about displaying words with highest values in descending order.

3. Config

Question 10

Try find all configuration for the default shell SparkContext. Whcih function to use?.
To create a new config, stop the running SparkContext first
Now create a new Conf Object with an AppName, and Master
Check if "spark.dynamicAllocation.enabled" is true, if False, execute the next step
Set the "spark.dynamicAllocation.enabled" to be "true" for the Conf
Pass the newly created conf as a parameter to the SparkContext constructor and assign the new SparkContext to sc

Hint:-

For finding the existing shell sc configuration,

http://stackoverflow.com/questions/30560241/is-it-possible-to-get-the-current-spark-context-settings-in-pyspark

Finding "spark.dynamicAllocation.enabled" status,

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-dynamic-allocation.html

For setting dynamicAllocation,

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-SparkConf.html

Note:-

So did you get curious why we set the Dynamic Allocation in the config?. Have you found out where it is useful?.

4. Partitions

Question 11

Now shall we try partitions with Question 6.
parallelize function takes two parameters, the first is the data object, the second is 'numSlices'
Fill the numSlices value <= your number of CPU cores
Now observe the time difference in executing steps 1 to 4 of Question 6
If feasible, try with 1 to 10 million random numbers
can you observe that finding optimum 'numSlices' value is itself an Optimization problem?.
Try using 'getNumPartitions' on any of the RDD's before and after increasing the number of partitions

Spark functions needed:-

parallelize
reduce
collect
getNumPartitions

Question 12

Another easy one, this is just a copy/paste of Question 9
Just with the textFile method, provide an addition argument 'minPartitions' and observe the time difference in evaluation

5. Spark SQL

Spark Session

We are going to create a Spark Session. Why this is different from SparkContext and SparkConf?.
Store the Spark session object into a variable called 'spark'

Hint :

https://docs.databricks.com/spark/latest/gentle-introduction/sparksession.html

Question 12

Loading data from JSON
using 'spark' variable, read 'peopl.json' from resources folder. Store it into a variable called 'df'
'df' is a dataFrame, view it's contents using "show" method
use 'select' function on 'df' to choose a single column and display it using 'show' method
use 'filter' function on 'df' to execute a filter condition like 'df[age] > 18' and display the results using 'show'

Self Exercise:-

Learn to create Temp View
Learn to create Global Temp View
Find the difference between Global temporary view and temporary view. And check it using an example.

For More SQL Exercises Refer,

For Python, https://github.com/apache/spark/tree/master/examples/src/main/python/sql

For Scala, https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/sql



In [ ]:

    
## Question 13 - TODO Week 3

* Loading data from text file

Apache Spark

Spark Documentation Guide

API Docs

Python

Scala

Spark Standalone Guide

Books and Other Good References

Week #2 Exercise Questions

Today's Session Outline:-

Things to remember:-

Configuration:-

1. Data Loading

Question 1

Spark functions needed:-

Question 2

Spark functions needed:-

2. MapReduce

Question 3

Spark functions needed:-

Question 4

Spark functions needed:-

Question 5

Spark functions needed:-

Question 6

Spark functions needed:-

Note:

Hint:

Question 7

Spark functions needed:-

Question 8

Spark functions needed:-

RDD set 1:-

RDD set 2:-

Question 9

Spark functions needed:-

Self Exercise:-

3. Config

Question 10

Hint:-

Note:-

4. Partitions

Question 11

Spark functions needed:-

Question 12

5. Spark SQL

Spark Session

Hint :

Question 12

Self Exercise:-

For More SQL Exercises Refer,